Goto

Collaborating Authors

 digital library


Predicting Brain Morphogenesis via Physics-Transfer Learning

Zhao, Yingjie, Song, Yicheng, Xu, Fan, Xu, Zhiping

arXiv.org Artificial Intelligence

Brain morphology is shaped by genetic and mechanical factors and is linked to biological development and diseases. Its fractal-like features, regional anisotropy, and complex curvature distributions hinder quantitative insights in medical inspections. Recognizing that the underlying elastic instability and bifurcation share the same physics as simple geometries such as spheres and ellipses, we developed a physics-transfer learning framework to address the geometrical complexity. To overcome the challenge of data scarcity, we constructed a digital library of high-fidelity continuum mechanics modeling that both describes and predicts the developmental processes of brain growth and disease. The physics of nonlinear elasticity from simple geometries is embedded into a neural network and applied to brain models. This physics-transfer approach demonstrates remarkable performance in feature characterization and morphogenesis prediction, highlighting the pivotal role of localized deformation in dominating over the background geometry. The data-driven framework also provides a library of reduced-dimensional evolutionary representations that capture the essential physics of the highly folded cerebral cortex. Validation through medical images and domain expertise underscores the deployment of digital-twin technology in comprehending the morphological complexity of the brain.


MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts

Piryani, Bhawna, Mozafari, Jamshid, Abdallah, Abdelrahman, Doucet, Antoine, Jatowt, Adam

arXiv.org Artificial Intelligence

Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.


Neural Network Modeling of Microstructure Complexity Using Digital Libraries

Zhao, Yingjie, Xu, Zhiping

arXiv.org Artificial Intelligence

Microstructure evolution in matter is often modeled numerically using field or level-set solvers, mirroring the dual representation of spatiotemporal complexity in terms of pixel or voxel data, and geometrical forms in vector graphics. Motivated by this analog, as well as the structural and event-driven nature of artificial and spiking neural networks, respectively, we evaluate their performance in learning and predicting fatigue crack growth and Turing pattern development. Predictions are made based on digital libraries constructed from computer simulations, which can be replaced by experimental data to lift the mathematical overconstraints of physics. Our assessment suggests that the leaky integrate-and-fire neuron model offers superior predictive accuracy with fewer parameters and less memory usage, alleviating the accuracy-cost tradeoff in contrast to the common practices in computer vision tasks. Examination of network architectures shows that these benefits arise from its reduced weight range and sparser connections. The study highlights the capability of event-driven models in tackling problems with evolutionary bulk-phase and interface behaviors using the digital library approach.


Dynamic faceted search: from haystack to highlight

AIHub

In the digital age, the amount of scholarly articles is growing exponentially. In the Open Research Knowledge Graph's question-answering facility ASK, for example, more than 80 million research articles have already been indexed. Finding the most relevant information from vast collections of scholarly data can be daunting for researchers, students, and academics. To tackle this challenge, search engines and digital libraries often rely on advanced search techniques, one of the most effective being faceted search. Faceted search is an advanced search method that allows users to filter and refine search results based on multiple predefined attributes, known as facets.


A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

Kroll, Hermann, Sackhoff, Pascal, Thang, Bill Matthias, Ksouri, Maha, Balke, Wolf-Tilo

arXiv.org Artificial Intelligence

Digital libraries that maintain extensive textual collections may One way to explore a digital library's content is to apply natural want to further enrich their content for certain downstream applications, language processing methods, e.g., identify central entities (e.g., e.g., building knowledge graphs, semantic enrichment of the Person Albert Einstein), their relationships (e.g., Albert Einstein documents, or implementing novel access paths. All of these applications was born in Ulm), and classify documents as belonging to require some text processing, either to identify relevant classes (e.g., descriptive articles). The extraction of semantic relationships entities, extract semantic relationships between them, or to classify between named entities is already used in several digital documents into some categories. However, implementing reliable, library projects for different purposes, e.g., constructing a biomedical supervised workflows can become quite challenging for a digital knowledge graph from scientific papers like SemMedDB [18], library because suitable training data must be crafted, and reliable harvesting leader boards of how computer science methods perform models must be trained. While many works focus on achieving the on benchmarks [17], harvesting scientific information as done highest accuracy on some benchmarks, we tackle the problem from in SciGraph [44], enabling graph-based discovery systems in digital a digital library practitioner. In other words, we also consider tradeoffs libraries [20], or enriching library content like newspapers as done between accuracy and application costs, dive into training data in the Swiss-Luxembourgish impresso [10].


STONYBOOK: A System and Resource for Large-Scale Analysis of Novels

Pethe, Charuta, Kim, Allen, Prabhakar, Rajesh, Pial, Tanzir, Skiena, Steven

arXiv.org Artificial Intelligence

Books have historically been the primary mechanism through which narratives are transmitted. We have developed a collection of resources for the large-scale analysis of novels, including: (1) an open source end-to-end NLP analysis pipeline for the annotation of novels into a standard XML format, (2) a collection of 49,207 distinct cleaned and annotated novels, and (3) a database with an associated web interface for the large-scale aggregate analysis of these literary works. We describe the major functionalities provided in the annotation system along with their utilities. We present samples of analysis artifacts from our website, such as visualizations of character occurrences and interactions, similar books, representative vocabulary, part of speech statistics, and readability metrics. We also describe the use of the annotated format in qualitative and quantitative analysis across large corpora of novels.


Topological Data Analysis in smart manufacturing processes -- A survey on the state of the art

Uray, Martin, Giunti, Barbara, Kerber, Michael, Huber, Stefan

arXiv.org Artificial Intelligence

Topological Data Analysis (TDA) is a mathematical method using techniques from topology for the analysis of complex, multi-dimensional data that has been widely and successfully applied in several fields such as medicine, material science, biology, and others. This survey summarizes the state of the art of TDA in yet another application area: industrial manufacturing and production in the context of Industry 4.0. We perform a rigorous and reproducible literature search of applications of TDA on the setting of industrial production and manufacturing. The resulting works are clustered and analyzed based on their application area within the manufacturing process and their input data type. We highlight the key benefits of TDA and their tools in this area and describe its challenges, as well as future potential. Finally, we discuss which TDA methods are underutilized in (the specific area of) industry and the identified types of application, with the goal of prompting more research in this profitable area of application.


ACM: Digital Library: Communications of the ACM

#artificialintelligence

Forecasting rates of sea level change in polar ice shelves: Polar scientists, along with atmospheric and ocean scientists, face an urgent need to understand sea level rise around the globe. Ice-shelf environments represent extreme environments for sampling and sensing. Current efforts to collect sensed data are limited and use tethered robots with traditional sampling frequency and collection limitations. The ability to collect extensive data about conditions at or near the ice shelves will inform our understanding about changes in ocean circulation patterns, as well as feedbacks with wind circulation. New research on intelligent sensors would support selective data collection, onboard data analysis, and adaptive sensor steering.


A Bayesian Learning, Greedy agglomerative clustering approach and evaluation techniques for Author Name Disambiguation Problem

Sourav, Shashwat

arXiv.org Artificial Intelligence

Author names often suffer from ambiguity owing to the same author appearing under different names and multiple authors possessing similar names. It creates difficulty in associating a scholarly work with the person who wrote it, thereby introducing inaccuracy in credit attribution, bibliometric analysis, search-by-author in a digital library, and expert discovery. A plethora of techniques for disambiguation of author names have been proposed in the literature. I try to focus on the research efforts targeted to disambiguate author names. I first go through the conventional methods, then I discuss evaluation techniques and the clustering model which finally leads to the Bayesian learning and Greedy agglomerative approach. I believe this concentrated review will be useful for the research community because it discusses techniques applied to a very large real database that is actively used worldwide. The Bayesian and the greedy agglomerative approach used will help to tackle AND problems in a better way. Finally, I try to outline a few directions for future work.


National Digital Library of India

Communications of the ACM

The National Digital Library of India was conceptualized with an aim to bring equity of access to educational resources for every Indian through a single window access mechanism.